Technical Assessment

This file contains the data analysis as requested. I also developed a Dashboard in order to better display and interact with Outvio’s databases.

Regarding the data analysis, the first step is to load the required packages.

# Loading Packages

library('tidyverse') # Package for data manipulation
library('lubridate') # Manipulate date variables 
library('plotly') # Package for data visualization

After that, it is time to read and import the data. After import databases using R, it was necessary to fix some variables types and create useful features.

# Reading data ----

# Packages - Reading CSV data

packages <- read.csv('data/packages.csv',  na.strings=c("NA","NaN", ""))

# Products - Reading CSV data

products <- read.csv('data/products.csv',  na.strings=c("NA","NaN", ""))

# Shipments - Read CSV data and fix date formats

shipments <- read.csv('data/shipments.csv', na.strings=c("NA","NaN", "")) %>%
  mutate(createdAt = lubridate::date(createdAt),
         deliverDate = lubridate::date(deliverDate),
         estimatedDeliverDate = lubridate::date(estimatedDeliverDate),
         pickupDate = lubridate::date(pickupDate),
         processDate = lubridate::date(processDate),
         deliveryTime = as.numeric(difftime(deliverDate, createdAt, units = "days")), # Create DeliveryTime as the difference in days between createdAt and deliverDate
         predictedDiff = as.numeric(difftime(estimatedDeliverDate, deliverDate, units = "days")), # Time difference between predicted and delivered dates
         delayed = ifelse(deliverDate > estimatedDeliverDate, 'Delayed', 'On Time')) # Create a variable that classifies delivered orders into Delayed or On Time

Exploratory analysis

Taking a look at some informations to better know data, look for missing data and outliers.

Looking for missing values

It is important to ensure data quality before proceeding data analysis. There is some missing columns on all datasets, and would be interesting to check with the team responsible for those informations if it is ok. Since I don’t have full context about data origin and tracking I won’t remove rows based on missing values, as long as crucial information about shipiments, packages and products (X_id, packages, products) are present.

Some packages don’t present the package code, so woun’t be possible to join those with the products data.

#Shipments missing values per column

sapply(shipments, function(y) sum(length(which(is.na(y))))) %>% as.data.frame()
#Packages missing values per column

sapply(packages, function(y) sum(length(which(is.na(y))))) %>% as.data.frame()
#Products missing values per column

sapply(products, function(y) sum(length(which(is.na(y))))) %>% as.data.frame()

Available Data Range

Apparently, mostly of the shipments present on the database were created at March 25. Taking a first look, the deliver time seems to be under control, since the major part of those shipments were delivered at March 27.

Would be interesting to deep dive those number in order to understand how couriers and methods impact the deliver time.

# Created shipments per day

shipments %>% 
  group_by(createdAt) %>%
  summarise(count = n_distinct(X_id)) %>%
  plot_ly(y = ~count, 
          x = ~createdAt, 
          type = 'bar') %>%
  layout(xaxis = list(title = 'Creation Date'),
         yxis = list(title = '# of orders'),
         title = '# of orders per creation date')
# Delivered shipments per day

shipments %>% 
  group_by(deliverDate) %>%
  summarise(count = n_distinct(X_id)) %>%
  plot_ly(y = ~count, 
          x = ~deliverDate, 
          type = 'bar') %>%
  layout(xaxis = list(title = 'Deliver Date'),
         yxis = list(title = '# of orders'),
         title = '# of orders per deliver date')

Packages weight distribution

Almost all packages weights less than 50 weight unities. It is possible to see some outliers weighting 200 weight unities.

# Created shipments per day

packages %>% 
  plot_ly(x = ~weight,
          type = 'histogram')

Minimum requirements

This section presents the resolution of the minimum requirements of this task.

1. Average delivery time per courier

DeuschePost presents the highest deliveryTime, followed by transaher and fedex.

shipments %>% 
  filter(!is.na(deliverDate)) %>% # Remove not delivered orders
  group_by(courier) %>%
  summarise(mean = mean(deliveryTime)) %>%
  arrange(desc(mean))

2. Average delivery time per shipping method

“dhl express - gpt - priority (packet tracked)” presents the highest deliveryTime.

shipments %>% 
  filter(!is.na(deliverDate)) %>% # Remove not delivered orders
  group_by(courier) %>%
  summarise(mean = mean(deliveryTime)) %>%
  arrange(desc(mean))

3. Average products per order

Assuming that 1 order is equal to one shipment, in average, each order presents 2.85 products.

shipments_aux <- shipments %>%
  separate_rows(packages) %>% 
  filter(!packages %in% c('oid', ':', '')) # Unnest and create one row per package

packages_aux <- packages %>% 
  separate_rows(products) %>%
  filter(!products %in% c('oid', ':', '')) # Unnest and create on row per product

products_per_order <- shipments_aux %>%
  inner_join(packages_aux, by = c('packages' = 'X_id')) %>% # Join Shipments and Packages data
  group_by(X_id) %>%
  summarise(products = n_distinct(products)) %>% # Count # of products whithin each order
  ungroup() %>%
  summarise(products_per_order = mean(products))

c('In average, each order presents 2.85 products.')
## [1] "In average, each order presents 2.85 products."

Aditional Analysis

In order to better explore the provided data bases, I formulate some questions to be answered through data analysis.

2. How accurate is the delivery prediction?

Delayed vs On Time orderds

Comparing the effective deliver time with the predicted, 16% of the orders have been delayed through the analyze time period.

shipments %>%
  mutate(const = 'Delayed?') %>%
  filter(!is.na(delayed)) %>% # Remove not delivered orders
  group_by(delayed, const) %>%
  summarise(count = n_distinct(X_id)) %>%
  ungroup() %>%
  mutate(perc = count/sum(count)) %>% # Criando a visão percentual
  plot_ly(y = ~perc, 
          x = ~const,
          color = ~delayed,
          type = 'bar') %>%
  layout(barmode = 'stack') %>%
  layout(yaxis = list(title = 'Percentage'),
         xaxis = list(title = ''),
         title = 'Percentage of delayed orders')

Delayed vs On Time orders per Courier

The plot is comparing On Time delivered orders versus Delayed orders per courier. Envialia presents a difficult situation, since it is the most popular courier and, regarding the main couriers, is the one that presents more delayed orders.

shipments %>% 
  filter(!is.na(delayed)) %>% # Remove not delivered orders
  group_by(delayed, courier) %>%
  summarise(count = n_distinct(X_id),
            predictedDiff = mean(predictedDiff)) %>%
  ungroup() %>%
  group_by(courier) %>%
  mutate(perc = count/sum(count)) %>% # Criando a visão percentual
  ungroup() %>%
  plot_ly(y = ~count, 
          x = ~reorder(courier, desc(count)),
          color = ~delayed,
          type = 'bar') %>%
  layout(barmode = 'stack') %>%
  layout(yaxis = list(title = '# of Orders'),
         xaxis = list(title = 'Courier'),
         title = 'Delayed vs On Time orders per Courier')

Delayed vs On Time orders per Courier (Percentage)

The plot is comparing On Time delivered orders versus Delayed orders per courier. Envialia presents a difficult situation, since it is the most popular courier and, regarding the main couriers, is the one that presents more delayed orders.

shipments %>% 
  filter(!is.na(delayed)) %>% # Remove not delivered orders
  group_by(delayed, courier) %>%
  summarise(count = n_distinct(X_id),
            predictedDiff = mean(predictedDiff)) %>%
  ungroup() %>%
  group_by(courier) %>%
  mutate(perc = count/sum(count)) %>% # Criando a visão percentual
  ungroup() %>%
  plot_ly(y = ~perc, 
          x = ~reorder(courier, desc(count)),
          color = ~delayed,
          type = 'bar') %>%
  layout(barmode = 'stack')%>%
  layout(yaxis = list(title = 'Percentage of orders'),
         xaxis = list(title = 'Courier'),
         title = 'Percentage of Delayed vs On Time orders per Courier')

How many days the delayed orders took to arrive?

60% of the delayed orders were delayed by only one day. Almost 90% of the delayed orders were delayed by a maximum of 5 days.

shipments %>% 
  filter(delayed == 'Delayed') %>% # Remove not delivered orders
  group_by(predictedDiff) %>%
  summarise(count = n_distinct(X_id)) %>%
  ungroup() %>%
  mutate(perc = count/sum(count)) %>% # Criando a visão percentual
  plot_ly(x = ~predictedDiff, 
          y = ~perc, 
          type = 'bar') %>%
  layout(yaxis = list(title = 'Percentage of Orders'),
         xaxis = list(title = 'Days of delay'),
         title = 'Delayed days per delayed Packages - Predicted Deliver Date - Deliver Date')

3. How can we compare couriers regarding performance?

Courier’s delivery time vs number of orders

One option to estimate which are the top performers courier is to look for the orders volume of delivers versus delivery Time.

The following graph displays each Courier based on deliverTime and volume of orders. The lines represents the average order per courier and the average deliverTime.

Interpreting: Couriers at the top-left have more orders than average and lower deliveryTime, what could represent a good performance. On the other hand, couriers at the top-right presents high volume of orders but a deliveryTime above the average.

The top-right courier is ups. Since it’s a courier that has a high number of orders, it would be important to understand what is causing that increase on time.

Also interesting is the fact that envialia is the main Courier in terms of number of orders. But, taking into account only On Time orders, mrw presents the highest absolut number of orders.

# Found the average number of orders and the average deliver Time per courier 

avg_order <- shipments %>%
  filter(!is.na(deliverDate) & delayed == 'On Time') %>% # Remove not delivered orders  or delayed orders
  group_by(courier) %>%
  summarise(volume = n_distinct(X_id),
            deliveryTime = mean(deliveryTime)) %>%
  ungroup() %>%
  summarise(mean_time = mean(deliveryTime),
            mean_orders = mean(volume))

# Dispaly plot comparing # of orders vs avg deliver time

shipments %>%
  filter(!is.na(deliverDate) & delayed == 'On Time') %>% # Remove not delivered orders 
  group_by(courier) %>%
  summarise(volume = n_distinct(X_id),
            deliveryTime = mean(deliveryTime)) %>%
  plot_ly(x = ~deliveryTime,
         y = ~volume,
         text = ~courier,
         color = ~courier) %>%
  layout(title = 'Couriers - Orders vs Deliver Time - Considering only On Time orders',
         showlegend = FALSE,
         shapes = list(list(
                            type = "line", 
                            y0 = 0, 
                            y1 = 1, 
                            yref = "paper",
                            x0 = 3.88, 
                            x1 = 3.88, 
                            line = list(color = "gray")
                          ), list(
                            type = "line", 
                            x0 = 0, 
                            x1 = 1, 
                            xref = "paper",
                            y0 = 110, 
                            y1 = 110, 
                            line = list(color = "gray")
                          )),
         xaxis = list(title = 'Avg. Deliver Time',
                      showgrid = FALSE, 
                      zeroline = FALSE, 
                      showticklabels = FALSE),
         yaxis = list(title = 'Number of orders',
                      showgrid = FALSE, 
                      zeroline = FALSE, 
                      showticklabels = FALSE))

Method’s delivery time vs number of orders

Reproducing the same vision for methods, the result looks the same, since couriers almost always have only one relevant shipping method.

# Found the average number of orders and the average deliver Time per courier 

avg_order <- shipments %>%
  filter(!is.na(deliverDate) & delayed == 'On Time') %>% # Remove not delivered orders  or delayed orders
  group_by(method) %>%
  summarise(volume = n_distinct(X_id),
            deliveryTime = mean(deliveryTime)) %>%
  ungroup() %>%
  summarise(mean_time = mean(deliveryTime),
            mean_orders = mean(volume))

# Dispaly plot comparing # of orders vs avg deliver time

shipments %>%
  filter(!is.na(deliverDate) & delayed == 'On Time') %>% # Remove not delivered orders 
  group_by(method) %>%
  summarise(volume = n_distinct(X_id),
            deliveryTime = mean(deliveryTime)) %>%
  plot_ly(x = ~deliveryTime,
         y = ~volume,
         text = ~method,
         color = ~method) %>%
  layout(title = 'Methods - Orders vs Deliver Time - Considering only On Time orders',
         showlegend = FALSE,
         shapes = list(list(
                            type = "line", 
                            y0 = 0, 
                            y1 = 1, 
                            yref = "paper",
                            x0 = 3.88, 
                            x1 = 3.88, 
                            line = list(color = "gray")
                          ), list(
                            type = "line", 
                            x0 = 0, 
                            x1 = 1, 
                            xref = "paper",
                            y0 = 110, 
                            y1 = 110, 
                            line = list(color = "gray")
                          )),
         xaxis = list(title = 'Avg. Deliver Time',
                      showgrid = FALSE, 
                      zeroline = FALSE, 
                      showticklabels = FALSE),
         yaxis = list(title = 'Number of orders',
                      showgrid = FALSE, 
                      zeroline = FALSE, 
                      showticklabels = FALSE))

Total products price per shipment distribution

First, it is necessary to join all the data tables. It was necessary to unnest the packages column in the shipment data table and unnest the products column in the packages table. Also, i’m removing products that have price equals to 0.

The median order contains 61 euros products inside. However, there are outliers. Some orders contains 1000+ euros products.

# Calculate the products price per order

shipments_aux <- shipments %>%
  separate_rows(packages) %>% 
  filter(!packages %in% c('oid', ':', '')) # Unnest and create one row per package

packages_aux <- packages %>% 
  separate_rows(products) %>%
  filter(!products %in% c('oid', ':', '')) # Unnest and create on row per product

shipments_aux %>%
  inner_join(packages_aux, by = c('packages' = 'X_id')) %>%
  inner_join(products, by = c('products' = 'X_id')) %>%
  filter(!is.na(price),
         price > 0) %>% # remove products without price and that presents price equals to 0
  group_by(X_id) %>%
  summarise(total_price = sum(price)) %>%
  plot_ly(y = ~total_price, type = 'box') %>%
  layout(yaxis = list(title = 'Price per Shipment'),
         xaxis = list(title = '',
                      showgrid = FALSE, 
                      zeroline = FALSE, 
                      showticklabels = FALSE),
         title = 'Products Price per order')